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Abstract 

Numerous approaches in image processing and com- 
puter vision are making use of super-pixels as a pre- 
processing step. Among the different methods producing 
such over-segmentation of an image, the graph-based ap- 
proach of Felzenszwalb and Huttenlocher is broadly em- 
ployed. One of its interesting properties is that the regions 
are computed in a greedy manner in quasi-linear time. The 
algorithm may be trivially extended to video segmentation 
by considering a video as a 3D volume, however, this can 
not be the case for causal segmentation, when subsequent 
frames are unknown. We propose an efficient video segmen- 
tation approach that computes temporally consistent pixels 
in a causal manner, filling the need for causal and real time 
applications. 

1. Introduction 

A segmentation of video into consistent spatio-temporal 
segments is a largely unsolved problem. While there have 
been attempts at video segmentation, most methods are non 
causal and non real-time. This paper proposes a fast method 
for real time video segmentation, including semantic seg- 
mentation as an application. 

An large number of approaches in computer vision 
makes use of super-pixels at some point in the process. For 
example, semantic segmentation [ ], geometric context in- 
dentification [ ], extraction of support relations between 
object in scenes [14], etc. Among the most popular ap- 
proach for super-pixel segmentation, two types of methods 
are distinguishable. Regular shape super-pixels may be pro- 
duced using normalized cuts [ ] for instance. More object 
- or part of object - shaped super-pixels can be generated 
from watershed based approaches. In particular the method 
of Felzenswalb and Huttenlocher [ ] produces such results. 

It is a real challenge to obtain a decent delineation of 
objects from a single image. When it comes to real-time 
data analysis, the problem is even more difficult. How- 
ever, additional cues can be used to constrain the solution 



to be temporally consistent, thus helping to achieve better 
results. Since many of the underlying algorithms are in gen- 
eral super-linear, there is often a need to reduce the dimen- 
sionality of the video. To this end, developing low level 
vision methods for video segmentation is necessary. Cur- 
rently, most video processing approaches are non-causal, 
that is to say, they make use of future frames to segment a 
given frame, sometimes requiring the entire video [ ]. This 
prevents their use for real-time applications. 

Some approaches have been designed to address the 
causal video segmentation problem [15, 13]. [15] makes use 
of the mean shift method [ ] . As this method works in a fea- 
ture space, it does not necessary cluster spatially consistent 
super-pixels. A more recent approach, specifically applied 
for semantic segmentation, is the one of Miksik et al. [13]. 
The work of [13] is employing an optical flow method to 
enforce the temporal consistency of the semantic segmenta- 
tion. Our approach is different because it aims to produce 
super-pixels, and possibly uses the produced super-pixels 
for smoothing semantic segmentation results. Furthermore, 
we do not use any optical flow pre-computation that would 
prevent us having real time performances on a CPU. 

Some works use the idea of enforcing some consis- 
tency between different segmentations [7, 12, 11, 8]. [ ] 
formulates a co-clustering problem as a Quadratic Semi- 
Assignment Problem. However solving the problem for a 
pair of images takes about a minute. Alternatively, [12] and 
[ ] identify the corresponding regions using graph match- 
ing techniques. The approach of [8] is only illustrated on 
very coarse image segmentation, and the graph matching is 
performed between graphs that contain a dozen of regions. 
In both cases, the number of tracked regions is limited in 
the experiments to a small amount. 

The idea developped in this paper is to perform indepen- 
dent segmentations and match the produced super-pixels to 
define markers. The markers are then used to produce the 
final segmentation by minimizing a global criterion defined 
on the image. The graph used in the independent segmen- 
tation part is reused in the final segmentation stage, leading 
thus to gains in speed, and real-time performances on a sin- 
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gle core CPU. 
2. Method 

Given a segmentation St of an image at time t, we wish 
to compute a segmentation St+i of the image at time t + 1 
which is consistent with the segments of the result at time t. 

2.1. Independent image segmentation 

The super-pixels produced by [6] have been shown to 
satisfy the global properties of being not too coarse and not 
too fine according to a particular region comparison func- 
tion. In order to generate superpixels close to the ones pro- 
duced by [ ], we first generate independent segmentations 
of the 2D images using [ ] . We name these segmentations 
S[,...,S' t . The principle of segmentation is fairly simple. 
We define a graph G t , where the nodes correspond to the 
image pixels, and the edges link neighboring nodes in 8- 
connectivity. The edge weights Uij between nodes i and j 
are given by a color gradient of the image. 

A Minimum Spanning Tree (MST) is build on G t , and a 
region merging criteria is defined. Specifically, two regions 
X and Y are merged when: 

Diff(X,Y) < min (lnt{X) + A jJnt (y) + , (1) 

where k is a parameter allowing to prevent the merging of 
large regions. The internal difference Int(X) of a region 
X is the highest weight of an edge linking two vertices of 
X in the MST. The difference Diff(X,Y) between two 
neighboring regions X and Y is the smallest weight of an 
edge that links X to Y . 

Once an image is independently segmented, resulting in 
S' t +i, we then face the question of the propagation of the 
temporal consistency given the non overlapping contours of 
S t mdS' t+1 . 

Our solution is the development of a cheap graph match- 
ing technique to obtain correspondences between segments 
from St and these of S f t+1 . This first step is described in 
Section 2.2. We then mine these correspondences to cre- 
ate markers (also called seeds) to compute the final labeling 
St+i by solving a global optimization problem. This second 
step is detailed in Section 2.3. 

2.2. Graph matching procedure 

The basic idea is to use the segmentation St and seg- 
mentation to produce markers before a final segmen- 
tation of image at time t + 1. Therefore, in the process of 
computing a new segmentation St+i, a graph G is defined. 
The vertices of G comprises to two sets of vertices: V t that 
corresponds to the set of regions of St and V/ +1 that cor- 
responds to the set of regions of S^ +1 . Edges link regions 
characterised by a small distance between their centroids. 
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Figure 1 . Illustration of the graph matching procedure 

The edges weights between vertex i G Vt and j G V{ +1 are 
given by a similarity measure taking into account distance 
and differences between shape and appearance 

_ (Inl + \r j \)d{c i ,c j ) 

where |r^| denotes the number of pixels of region ru 
\ri H rj\ the number of pixels present in T{ and rj with 
aligned centroids, and the appearance difference of re- 
gions n and rj . In our experiments was defined as the 
difference between mean color intensities of the regions. 

The graph matching procedure is illustrated in Figure 1 
and produces the following result: For each region of S' t+1 , 
its best corresponding region in image St is identified. More 
specifically, each node i of V t is associated with the node 
j of V t f +1 which minimizes Wij. Symmetrically, for each 
region of St, its best corresponding region in image S^^ is 
identified, that is to say each node i of V{ +1 is associated 
with the node j of V t which minimizes . 

2.3. Final segmentation procedure 

The final segmentation St+i is computed using a min- 
imum spanning forest procedure. This seeded segmenta- 
tion algorithm that produces watershed cuts [ ] is strongly 
linked to global energy optimization methods such as graph- 
cuts [1, ] as detailed in Section 2.4. In addition to theoreti- 
cal guaranties of optimality, this choice of algorithm is mo- 
tivated by the opportunity to reuse the sorting of edges that 
is performed in 2.1 and constitutes the main computational 
effort. Consequently, we reuse here the graph Gt+i(V, E) 
built for the production of independent segmentation S' t+1 . 

The minimum spanning forest algorithm is recalled in 
Algorithm 1. The seeds, or markers, are defined using the 
regions correspondences computed in the previous section, 
according to the procedure detailed below. For each seg- 
ment s' of S' t+1 four cases may appear: 

1 . s' has one and only one matching region s in St : prop- 
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Algorithm 1: Minimum Spanning Forest algorithm 

Data: A weighted graph G(V : E) and a set of labeled 
nodes makers L. Nodes of V \ L have unknown 
labels initially. 
Result: A labeling x associating a label to each vertex. 
Sort the edges of E by increasing order of weight, 
while any node has an unknown label do 
Find the edge in E of minimal weight; 
if Vi or Vj have unknown label values then 

Merge Vi and Vj into a single node, such that 
when the value for this merged node becomes 
known, all merged nodes are assigned the 
L same value of x and considered known. 



agate the label l s of region s. All nodes of s' are la- 
beled with the label l s of region s. 

2. s f has several corresponding regions si, ...,s r : prop- 
agate seeds from S t . The coordinates of regions 
si, s r are centered on region s f . The labels of re- 
gions 81, s r whose coordinates are in the range of 
s f are propagated to the nodes of s f . 

3. s f has no matching region : The region is labeled by 
the label V 3 itself. 

4. If none of the previous cases is fulfilled, it means that 
s' is part of a larger region s in St . If the size of s f is 
small, a new label is created. Otherwise, the label l s is 
propagated in s' as in case 1 . 

Before applying the minimum spanning forest algorithm, 
a safety test is performed to check that the map of produced 
markers does not differ two much from the segmentation 
S f t+1 . If the test shows large differences, an eroded map of 
S' t+1 is used to correct the markers. 

2.4. Global optimization guaranties 

Several graph-based segmentation problems, including 
minimum spanning forests, graph cuts, random walks and 
shortest paths have recently been shown to belong to a com- 
mon energy minimization framework [ ]. The considered 
problem is to find a labeling x* G M' y ' defined on the nodes 
of a graph that minimizes 

E(x)= w ij\ x 3~ x i\ q + ^2w?\k-Xi\ q , (3) 

eijeE Viev 

where I represents a given configuration and x 
represents the target configuration. The result of 
linip^oo arg min^ E(x) for values of q > 1 always pro- 
duces a cut by maximum (equivalently minimum) spanning 
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Figure 2. Segmentation results on 3 consecutive frames of the 
NYU- Scene dataset. 

forest. The reciprocal is also true if the weights of the graph 
are all different. 

In the case of our application, the pairwise weights is 
given by an inverse function of the original weights uoij . The 
pairwise term thus penalizes any unwanted high-frequency 
content in x and essentially forces x to vary smoothly within 
an object, while allowing large changes across the object 
boundaries. The second term enforces fidelity of x to a 
specified configuration I, wi being the unary weights en- 
forcing that fidelity. 

The enforcement of markers l s as hard constrained may 
be viewed as follows: A node of label l s is added to the 
graph, and linked to all nodes i of V that are supposed to be 
marked. The unary weights u;^ s are set to arbitrary large 
values in order to impose the markers. 

2.5. Applications to optical flow and semantic seg- 
mentation 

An optical flow map may be easily estimated from two 
successive segmentations St and St+i. For each region r of 
St+i, if the label of r comes from a label present in a region 
s of the segmentation St, the optical flow in r is computed 
as the distance between the centroid of r and the centroid of 
s. The optical flow map may be used as a sanity check for 
region tracking applications. By principle, a video sequence 
will not contain displacements of objects greater than a cer- 
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Figure 3. Comparison with the mean-shift segmentation method of Paris [ ] on Frame 19 and 20. k = 200, S = 400, a — 0.5. 



tain value. 

For each superpixel s of St+i, if the label of region s 
comes from the previous segmentation St, then the semantic 
prediction from St is propagated to St+i. Otherwise, in 
case the label of s is a new label, the semantic prediction is 
computed using the prediction at time t + 1. As some errors 
may appear in the regions tracking, the labels of regions that 
have inconsistent large values in optical flow maps are not 
propagated. 

For the specific task of semantic segmentation, results 
can be improved by exploiting the contours of the recog- 
nized objects. Semantic contours such as for example tran- 
sition between a building and a tree for instance, might not 
be present in the gradient of the raw image. Thus, in addi- 
tion to the pairwise weights uj described in Section 2.1, we 
add a constant in the presence of a semantic contour. 

3. Results 

We now demonstrate the efficiency and versatility of our 
approach by applying it to different problems: simple super- 
pixel segmentation, semantic scene labeling, and optical 
flow. 

Following the implementation of [ ], we pre-process the 
images using a Gaussian filtering step with a kernel of vari- 
ance a is employed. A post-processing step that removes 
regions of small size, that is to say below a threshold S is 
also performed. As in [ ], we denote the scale of observa- 
tion parameter by k. 

3.1. Super-pixel segmentation 

Experiments are performed on two different types of 
videos: videos where the camera is static, and videos where 
the camera is moving. The robustness of our approach to 



large variations in the region sizes and large movements of 
camera is illustrated on Figure 2. 

A comparison with the temporal mean shift segmenta- 
tion of Paris [ i] is displayed at Figure 3. The super- 
pixels produced by the [ ] are not spatially consistent as 
the segmentation is performed in the feature (color) space 
in their case. Our approach is slower, although qualified for 
real-time application but computes only spatially consistent 
super-pixels. 

3.2. Semantic scene labeling 

We suppose that we are given a noisy semantic labeling 
for each frame. In this work we used the semantic predic- 
tions of [5]. 

We compare our results with the results of [ ] on the 
NYU-Scene dataset. The dataset consists in a video se- 
quence of 73 frames provided with a dense semantic la- 
beling and ground truths. The provided dense labeling be- 
ing performed with no temporal exploitation, it suffers from 
sudden large object appearances and disappearances. As il- 
lustrated in Figure 4 our approach reduces this effect, and 
improves the classification performance of more than 5% 
as reported in Table 3.2. We also display results on another 
video in Figure (5). 

3.3. Optical flow 

As detailed in Section 2.5, we compute the optical flow 
maps between subsequent frames of videos. An example 
of result is shown in Figure 5, that illustrates the accurate 
detection of a large move of the camera from the Frames 9 
to 10 of a video taken on a highway. 
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(a) Independent segmentations with no temporal smoothing 




(b) Result using the temporal smoothing method of [ 




(c) Our temporally consistent segmentation 
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Figure 4. Comparison with the temporal smoothing method of [ ]. Parameters used: k = 1200, S = 100, a = 1.2. 





Frame 
by frame 


Miksik 
etal.[13] 


Our 
method 


Accuracy 


71.11 


75.31 


76.27 


#Frames/sec 




1.33* 


10.5 



Table 1 . Overall pixel accuracy (%) for the semantic segmentation 
task on the NYU Scene video. *Note that the reported timing does 
not take into account the optical flow computation needed by [13]. 



3.4. Computation time 

The experiments were performed on a laptop with 2.3 
GHz intel core i7-3615QM, Memory 8Go DDR3 1600 
MHz. Our method is implemented on CPU only, in C/C++, 
and makes use of only one core of the processor. Super- 
pixel segmentations take 0.1 seconds per image of size 
320 x 240 and 0.4 seconds per image of size 640 x 380, 
thus demonstrating the scalability of the pipeline. All com- 
putations are included in the reported timings. 

The timings of the temporal smoothing method of Mik- 
sik et a/. [13] are reported in Table 3.2. We note that the 
processor used for the reported timings of [13] has similar 
characteristics as ours. Furthermore, Mistik et al. use an op- 
tical flow procedure that takes only 0.02 seconds per frame 
when implemented on GPU, but takes seconds on CPU. Our 
approach is thus more adapted to real time applications for 



instance on embedded devices where a GPU is often not 
available. 



4. Conclusion 

The proposed approach employs a graph matching tech- 
nique to produce markers used in a global optimization pro- 
cedure for video segmentation. Unlike many video segmen- 
tation techniques, our algorithm is causal - which is a re- 
quired property for real-time applications - and does not 
require any computation of optical flow. Our experiments 
on challenging videos show that the obtained super-pixels 
are robust to large camera or objects displacement. Their 
use in semantic segmentation applications demonstrate that 
significant gains can be achieved and lead to state-of-the- 
art results. Furthermore, by being 8 times faster than the 
competing method for temporal smoothing of semantic seg- 
mentation, and up to 25 times faster if the use of GPU is not 
available, the proposed approach has by itself a practical 
interest. 
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(a) Frame 9 



(b) Frame 10 



(c) Segmentation Sg 



(d) Segmentation Sio 




(e) Optical flow (f) Legend (g) Frame by frame segm. (h) Our semantic segm. 

Figure 5. By matching regions, our method can be used to derive an optical flow (e). Parameters used: k = 200, 5 = 400, a = 0.81 
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